Daten einer Website per HTTP ermitteln

In diesem kleinen Demoprogramm zeigen wir, wie man sich die Daten einer Website beschafft. Aus den Daten werden alle Links extrahiert und angezeigt.

Mithilfe der Klasse CL_HTTP_CLIENT besorgen wir uns den Quelltext einer Internetseite. Die URL muss dabei komplett angegeben werden, also inklusive http://

Coding

*:: Selection screen
PARAMETERS p_url TYPE string LOWER CASE
DEFAULT 'https://tricktresor.de/wp-content/index.php?aID=0'.

START-OF-SELECTION.

  PERFORM get_urls USING p_url.

*&---------------------------------------------------------------------*
*&      Form  GET_URLS
*&---------------------------------------------------------------------*
FORM get_urls  USING  iv_url TYPE clike.

*:: local data
  DATA lv_http_url     TYPE string.
  DATA lr_http_client  TYPE REF TO if_http_client.
  DATA lv_html_code    TYPE string.

  DATA lt_urls         TYPE STANDARD TABLE OF string
WITH NON-UNIQUE DEFAULT KEY.
  DATA lt_new          LIKE lt_urls.
  DATA lv_regex        TYPE string.

  DATA lv_url          TYPE string.
  DATA lv_dummy1       TYPE string.
  DATA lv_dummy2       TYPE string.

  STATICS lt_list      TYPE HASHED TABLE OF string
WITH UNIQUE KEY table_line.

*:: create url
  CALL METHOD cl_http_client=>create_by_url
    EXPORTING
      url                = iv_url
    IMPORTING
      client             = lr_http_client
    EXCEPTIONS
      argument_not_found = 1
      plugin_not_active  = 2
      internal_error     = 3
      OTHERS             = 4.
  IF sy-subrc > 0.
*:: error
    WRITE: AT 40 'Unable to create url, Sy-Subrc;', sy-subrc.
    STOP.
  ENDIF.

*:: Send out request
  lr_http_client->send( ).

*:: Receive result as stream
  CALL METHOD lr_http_client->receive
    EXCEPTIONS
      http_communication_failure = 1
      http_invalid_state         = 2
      http_processing_failed     = 3
      OTHERS                     = 4.

  IF sy-subrc <> 0.
*:: error
    WRITE: AT 40 'Unable to read data, Sy-Subrc;', sy-subrc.
  ELSE.
*:: Get sourcecode
    lv_html_code = lr_http_client->response->get_cdata( ).
    WRITE:/ iv_url COLOR 5.

*:: simple method - Find urls
    SPLIT lv_html_code AT 'href=' INTO TABLE lt_urls.
    LOOP AT lt_urls INTO lv_url.
      FORMAT COLOR OFF.
      CHECK lv_url IS NOT INITIAL.
      CHECK lv_url(1) = `"`
      OR lv_url(1) = `'`.
      FIND lv_url(1) IN lv_url+1 MATCH OFFSET sy-fdpos.
      CHECK sy-subrc = 0.
      lv_url = lv_url+1(sy-fdpos).

      IF lv_url IS INITIAL.
        CONCATENATE iv_url lv_url INTO lv_url.

      ELSEIF lv_url(1) = '#'.
        CONCATENATE iv_url lv_url INTO lv_url.

      ELSEIF lv_url(1) = '/'.  "Root
        FORMAT COLOR COL_GROUP.

      ELSEIF lv_url(1) = '?'.
        SPLIT iv_url AT '?' INTO lv_dummy1 lv_dummy2.
        IF sy-subrc = 0.
          CONCATENATE lv_dummy1 lv_url INTO lv_url.
        ELSE.
        ENDIF.
      ELSEIF lv_url(5) = 'https' OR lv_url(5) = 'HTTPS'.
        FORMAT COLOR COL_POSITIVE.
      ELSEIF lv_url(4) = 'http' OR lv_url(4) = 'HTTP'.
        FORMAT COLOR COL_NORMAL.
      ENDIF.

*:: try to find main URL in link
      CONCATENATE '^' iv_url INTO lv_regex.
      FIND REGEX lv_regex IN lv_url.
      IF sy-subrc = 0.
        FORMAT INTENSIFIED ON.
      ELSE.
        FORMAT INTENSIFIED OFF.
      ENDIF.
      WRITE: /10 lv_url.

    ENDLOOP.
    ULINE.

  ENDIF.

ENDFORM.                    " GET_URLS

 

Enno Wulff

Leave a Comment